Foundations of Natural Language Processing

An Introduction to Understanding and Generating Human Language

1. What is Natural Language?

Language is a natural system of communication that humans use to express thoughts, emotions, intentions, and information. Unlike formal systems such as mathematics or programming languages, human language is flexible, ambiguous, and highly dependent on context.

Humans can easily understand incomplete sentences, indirect requests, jokes, or sarcasm. For example, the sentence “Can you open the window?” is usually understood as a request, not a question about physical ability. This natural understanding comes effortlessly to humans but is extremely challenging for computers.

To study and model human language scientifically, the field of linguistics examines how language is structured and used. Linguistics helps us analyze language at different levels, from basic sounds and words to sentence structure and meaning in context.

Human language can be viewed as a layered system, where meaning is built gradually—from individual units like sounds and words, to sentences, and finally to interpretation based on context. Understanding these layers is important because Natural Language Processing attempts to model each of them computationally.

1.1 Levels of Language Structure

1.1.1 phonemes (sounds)

  • The smallest (or indivisible) unit of sound in a language. The phonemes do not have a meaning by themselves however in combination with other phonemes, a phoneme can change meanings.

  • Phonemes are particularly important in applications involving speech understanding, such as speech recognition, speech-to-text transcription, and text-to-speech conversion.

    e.g., The sounds /b/ and /p/ in “bat” and “pat” are different phonemes, changing the word’s meaning.
    “Shape” has three phonemes (/sh/, /long-a/, /p/) but five letters, showing phonemes are about sound, not letters

1.1.2 morphemes (meaningful units) and lexemes (words)

  • The smallest unit of language with a meaning is called morpheme. A combination of phonemes forms morpheme.

  • Lexemes are the structural variations of morphemes. They related to one another by meaning. For example, “run” and “running” belong to the same lexeme form.

    Free morphemes: Stand-alone words like “sing,” “water,” or “melon”.
    Bound morphemes: Affixes that attach to words, like the plural “-s” in “cats,” the past tense “-ed” in “walked,” or the prefix “un-” in “unhappy”.
    “Singing” has two morphemes: “sing” + “-ing” (present progressive).

    Morphemes

1.1.3 syntax (structure or rules for word order)

  • The syntax is the set of rules by which a person constructs grammatically correct full sentences out of words and phrases in a language.

    e.g. “She loves pizza” vs. “Pizza loves she” uses the same words but changes meaning due to word order (syntax).
    The order of adjectives: “a big red ball” sounds natural, while “a red big ball” does not, showing syntactic rules.

    syntax

1.1.4 pragmatics (context or meaning)

    • Context is how various parts in a language come together to convey a particular meaning. Context includes long-term references, world knowledge, and common sense along with the literal meaning of words and phrases. The meaning of a sentence can change based on the context

context

2. Why Is Natural Language So Hard for Computers?

If you give a computer the math problem \(2 + 2\), the answer is always \(4\). If you give a computer a sentence, the “answer” depends on the speaker, the listener, the time of day, and the history of the world.
Here are the four primary reasons language “breaks” machines:

2.1 Ambiguity (The “Double Meaning” Problem)

Ambiguity is the greatest enemy of NLP. A single word or sentence can have multiple valid interpretations.

Lexical Ambiguity: A single word has multiple meanings.

Example: “I went to the bank.” (Was it a river bank or a financial institution?)

Structural Ambiguity: The grammar of the sentence allows for different meanings.

Example: “I saw the man with the telescope.” (Did I use the telescope to see him, or does the man physically have a telescope?)

ambiguity

2.2 Context & World Knowledge

Computers lack “common sense.” They don’t know how the physical world works unless we tell them.

Example:

“The trophy didn’t fit into the brown suitcase because it was too large.”

“The trophy didn’t fit into the brown suitcase because it was too small.”

The Logic: In the first sentence, “it” is the trophy. In the second, “it” is the suitcase. Humans know this because we understand the physical relationship between objects; a computer only sees two identical sentence structures.

2.3 Variability & Informality

Unlike a programming language (like Python), human language has no “syntax error” that stops it from working. Humans are messy communicators.

e.g., Slang & Emojis: “That’s 🔥” means something very different than “There is a fire.”

Code-Mixing: In many cultures, people switch between languages in a single sentence (e.g., “Hinglish” or “English”).

Typos: A computer might treat “hello” and “helo” as two completely different entities, while a human doesn’t even blink.

3. What Is Natural Language Processing (NLP)?

Natural language processing (NLP) is a subfield of computer science and artificial intelligence (AI) that uses machine learning to enable computers to understand and communicate with human language.

It is a field of AI that enables computers to process, analyze, and generate human language.

3.1 Where does NLP sit?

NLP is not just “coding for text.” It is an interdisciplinary field that sits at the intersection of three major areas:

  • Computer Science: Providing the algorithms and computational power.

  • Artificial Intelligence: Providing the “learning” capability to improve over time.

  • Linguistics: Providing the rules and structure of how language actually works.

3.2 Key Distinction: Processing vs. Understanding

It is important to remember that Computers do not “understand” language the way you do.

  • Humans understand language through consciousness, sensory experience, and emotion.

  • Computers process language through pattern recognition and statistical probability.

When an AI responds to you, it isn’t “thinking” about your feelings; it is calculating which words are most likely to follow your question based on billions of examples it has seen before.

3.3 NLP in action

We interact with NLP dozens of times a day without realizing it. For example:

  • Information Retrieval (Search Engines): When you type a typo into Google and it says “Showing results for…”, NLP is analyzing the intent behind your misspelled words.

  • Text Classification (Spam Filters): Your email provider uses NLP to “read” incoming mail and decide if the patterns of words look like a legitimate message or a fraudulent scam.

  • Machine Translation: Tools like Google Translate use NLP to map the structure of one language (e.g., French) onto the structure of another (e.g., English) while trying to preserve the original meaning.

3.4 The Two Pillars of NLP: NLU and NLG

To truly “process” language, NLP system usually needs to perform two distinct types of operations: one for input and one for output.

3.4.1 NLU (Natural Language Understanding)

The Goal: Taking “messy” human language and turning it into a structured format a machine can use.

The Process: This involves figuring out the intent (what does the user want?) and the entities (what specific things are they talking about?).

Example: When you say to a smart speaker, “Set an alarm for 7 AM,” NLU is the part that identifies the action (Set_Alarm) and the time (07:00).

NLU vs NLG

3.4.2 NLG (Natural Language Generation)

The Goal: Taking structured data from a machine and turning it into a “natural” human sentence.

The Process: This involves text planning, sentence realization, and ensuring the output is grammatically correct and helpful.

Example: After you set that alarm, the smart speaker doesn’t just show a code; it says, “Okay, I’ve set your alarm for 7 AM.” That spoken sentence was created by NLG.

4. Why Do We Need NLP?

We don’t study NLP just because it’s a “hot topic” in AI. We study it because we have reached a point in human history where we are producing more language than we can possibly consume.

4.1 The Explosion of Unstructured Text Data

In computer science, data is often divided into structured (tables/databases) and unstructured (free-form text). Approximately 80% of all enterprise data is unstructured.

Every minute of every day:

  • Millions of emails are sent.

  • Thousands of legal documents and medical records are filed.

  • Global social media platforms generate billions of words in logs and posts.

Without NLP, this data is just “noise”—static that takes up storage space but provides no value.

4.2 Humans Cannot Process Text at Scale

Imagine a large e-commerce company receives 50,000 customer reviews a day. If they hired humans to read every single one to find “complaints about shipping”:

  • It would be too slow: By the time the report is finished, the problem is weeks old.

  • It would be too expensive: The labor cost would be astronomical.

  • It would be inconsistent: Ten different humans will categorize the “tone” of a review in ten different ways.

NLP is the only solution for consistency and speed at scale.

4.3 The Need for Automation and Insight

Organizations need to turn “words” into “decisions.” NLP provides the tools to do this automatically:

  • Automated Classification: A hospital’s system can “read” an incoming patient’s history and instantly route them to the correct specialist based on key symptoms mentioned in the text.

  • Efficient Summarization: A financial analyst needs to know what 50 different news articles say about a specific stock. An NLP model can summarize those 50 articles into a single paragraph of key takeaways.

  • Insight Discovery (Trend Analysis): A brand can monitor social media globally to see if people are becoming more or less frustrated with their product in real-time.

5. What NLP Can and Cannot Do

One of the biggest misconceptions in AI is that a computer “reads” like a human. To be a successful NLP practitioner, we must understand the boundary between pattern matching and genuine understanding.

5.1 NLP works Statistically, not Logically

When a model (even a powerful one like ChatGPT) answers a question, it isn’t using a set of logical “facts” stored in a brain. Instead, it is predicting the next most likely word based on statistical probability.

  • Example: If you ask a model “What is the capital of France?”, it doesn’t “know” Paris exists. It simply knows that in its billions of training examples, the words “Capital,” “France,” and “Paris” appear together with extremely high frequency.

5.2 Patterns vs. Meaning

NLP systems excel at identifying syntax (structure) but struggle with semantics (true meaning).

  • The Pattern: A system can easily identify that “The cat sat on the mat” is a valid sentence.

  • The Meaning: The system doesn’t know what a “cat” feels like, what “sitting” is, or the physical concept of a “mat.” It only knows these tokens relate to each other mathematically.

5.3 Where NLP Often Fails

Because NLP is built on patterns found in “training data,” it can fail spectacularly when it encounters something that doesn’t fit the pattern:

  • Sarcasm & Irony: Since sarcasm relies on the speaker saying the opposite of what they mean, statistical models often flag negative sarcastic comments as “highly positive.”

  • Jokes & Wordplay: Humor often relies on breaking linguistic rules or using cultural “inside knowledge” that isn’t explicitly written in text.

  • Cultural Nuance: Phrases that are polite in one culture might be flagged as aggressive in another if the model was only trained on a specific dialect.
    e.g., A head shake can mean “yes” or “no” depending on context, not always the Western “yes”.

5.4 The “Out-of-Distribution” Trap

If we train an NLP model on thousands of Shakespearean plays and then ask it to analyze modern Twitter (X) slang, it will fail. A model is only as good as the data it has seen. It cannot “reason” its way through a new type of language it wasn’t taught.

6. Common NLP Tasks

6.1 Text/Document Classification

Text classification is the process of assigning predefined categories or labels to text documents based on their content. It involves analyzing the text and determining which category it belongs to, often using machine learning algorithms.
e.g., Spam detection in emails classifies messages as “spam” or “not spam.” another example is categorizating news articles into topics like “sports,” “politics,” or “technology.”

Text Classification

6.2 Sentiment Analysis

Sentiment analysis is the process of determining the sentiment or emotion expressed in a piece of text. It involves analyzing the text and determining whether it is positive, negative, or neutral. This can be useful for understanding customer feedback, social media sentiment, or market trends.
e.g., A company can use sentiment analysis to analyze customer reviews and identify areas for improvement.

Sentiment Analysis

6.3 Information Extraction

Information extraction is the process of extracting structured information from unstructured text. It involves identifying and extracting specific pieces of information from a document, such as names, dates, locations, and other entities. This can be useful for tasks like named entity recognition, where the goal is to identify and classify named entities in a text.
e.g., A company can use information extraction to extract relevant information from news articles, such as the names of people mentioned in the article.

Information extraction working Extracting information from text

6.4 Part-of-Speech Tagging

Part-of-speech (POS) tagging is the process of assigning grammatical tags to each word in a sentence based on its role in the sentence. This can be useful for tasks like syntactic parsing, where the goal is to analyze the structure of a sentence.

Parts of speech

6.5 Language detection and machine translation

Language detection is the process of identifying the language in which a piece of text is written.

Machine translation is the process of automatically translating text from one language to another using computer algorithms and models.
e.g., Google Translate uses machine translation to translate text between different languages.

6.6 Knowledge Graph and QA systems

A knowledge graph (KG) is a structured network of entities (nodes) and their relationships (edges) that enables machines to understand context and meaning, making it a critical component in modern question-answering (QA) systems.
e.g. When asked, “Where was the painter of the Mona Lisa born?”, the KG links “Mona Lisa” → “painted by” → “Leonardo da Vinci” → “born in” → “Italy” to derive the answer

Diffbot maintains the largest public-web knowledge graph in terms of total scale, often marketing itself as having a graph “500x larger than Google’s” for certain web-crawled categories. Scale: Over 1 trillion facts and roughly 10 billion connected entities including people, companies, products, and articles.

6.7 Text parsing

Text parsing, also known as syntactic analysis, is the process of analyzing text to understand its structure and meaning based on grammatical rules, separating it into smaller components for further processing.

7. Phases of Natural Language Processing

It involves a series of phases that work together to process language and each phase helps in understanding structure and meaning of human language.
To a computer, a sentence is like a puzzle with five distinct layers of meaning. We call this the Linguistic Hierarchy.

7.1 Lexical and Morphological Analysis

(Understanding words before understanding sentences)

Before a machine can understand a sentence, it must first understand what the words are and how those words are formed.

7.1.1 Lexical Analysis

Lexical analysis is about finding words in text and labeling them.

Think of it like this:
Before reading a book, you first see the words, then recognize what kind of words they are.

* Tokenization:

Tokenization means splitting a sentence into words or meaningful pieces.

Example:
Sentence:

“I love programming”

After tokenization:

[“I”, “love”, “programming”]

This step is necessary because machines don’t naturally see words—only characters.

* Part-of-Speech (POS) Tagging:

After identifying words, the system asks: What role is each word playing?

Example:
Sentence:

“I am reading a book.”

Tokens with roles:

  • “I” → Pronoun
  • “am” → Verb
  • “reading” → Verb
  • “a” → Article
  • “book” → Noun

This helps the system understand who is doing what in a sentence.

Why Lexical Analysis Matters

  • It helps machines recognize individual words
  • It simplifies text so higher-level processing becomes possible
  • Almost every NLP task starts here

7.1.2 Morphological Analysis

Morphological analysis goes inside words.

Humans know that:

  • “run”, “running”, and “ran” are related
    Machines don’t—unless we teach them.

This level studies morphemes, the smallest meaningful parts of a word.

Examples:

  • “unhappy” → “un” + “happy”
  • “playing” → “play” + “ing”
* Stemming

Stemming cuts words down to a basic form.

Examples:

  • “running” → “run”
  • “connected” → “connect”

It’s fast, but sometimes rough.

* Lemmatization

Lemmatization is smarter and uses grammar.

Examples:

  • “better” → “good”
  • “mice” → “mouse”

Why Morphological Analysis Matters

  • It reduces word variation
  • Improves accuracy in search, translation, and tagging
  • Helps machines treat related words as related ideas

7.2 Syntactic Analysis (Parsing)

(Understanding sentence structure)

Once words are known, the machine asks:
How are these words arranged?

Syntactic analysis checks grammar and structure.

Humans instantly know this sentence is wrong:

“Apple eats John an.”

A machine needs rules to figure that out.

What Happens in Syntactic Analysis

  • Identifies subject, verb, and object
  • Builds a parse tree (a structure of the sentence)
  • Ensures the sentence follows grammar rules

Example:

  • Correct: “John eats an apple.”
  • Incorrect: “Apple eats John an.”

Same words. Different meaning.
Order matters.

Why Syntactic Analysis Matters

  • Helps machines understand relationships between words
  • Essential for translation and text generation
  • Reduces confusion caused by bad grammar

7.3 Semantic Analysis

(Understanding meaning)

A sentence can be grammatically correct and still make no sense.

Example:

“Apple eats John.”

Grammar is fine. Meaning is nonsense.

Semantic analysis checks:
Does this sentence make logical sense?

* Named Entity Recognition (NER)

NER identifies important real-world entities.

Example:

“Tata announced a new car in Delhi.”

Identified entities:

  • “Tata” → Organization
  • “Delhi” → Location

This adds meaning and context.

* Word Sense Disambiguation (WSD)

Some words have multiple meanings.

Example:

“I sat by the bank.”

Does “bank” mean:

  • a financial institution?
  • or the side of a river?

Context decides. Semantic analysis handles this.

Why Semantic Analysis Matters

  • Prevents logical errors
  • Improves question answering and search
  • Helps machines understand what is being talked about

7.4 Discourse Integration

(Understanding multiple sentences together)

Humans don’t read sentences in isolation. Machines must learn this too.

Discourse integration connects sentences to each other.

* Anaphora Resolution

This is about pronouns and references.

Example:

“Taylor went to the store. She bought groceries.”

“She” clearly refers to “Taylor”.

Without discourse understanding, a machine might get confused.

* Contextual Meaning

Some sentences make no sense alone.

Example:

“This is unfair!”

What is “this”?
We need earlier sentences to understand it.

Why Discourse Integration Matters

  • Maintains consistency across paragraphs
  • Critical for summarization and chatbots
  • Helps machines follow long conversations

7.5 Pragmatic Analysis

(Understanding intention, not just words)

Pragmatic analysis goes beyond literal meaning.

Humans rarely say exactly what they mean.

Example:

“Can you pass the salt?”

This is not a question about ability.
It’s a polite request.

* Non-Literal Language

Example:

“I’m falling for you.”

It doesn’t involve gravity.
It means affection.

Why Pragmatic Analysis Matters

  • Helps detect sarcasm, emotion, and intent
  • Crucial for chatbots and conversational AI
  • Makes responses feel natural and human

NLP works in layers:

  • First: What are the words?
  • Then: How are they arranged?
  • Then: What do they mean?
  • Then: How do sentences connect?
  • Finally: What does the speaker really want?

This layered understanding is what turns raw text into meaningful intelligence.

8. How NLP Evolved Over Time

NLP over time

Instead of just looking at it as a list of dates, think of it as a shift in philosophy: We moved from trying to teach the computer rules to letting the computer observe patterns for itself.

Here is the detailed breakdown of the three major eras of NLP.

If we want to build a machine that understands the sentence “The cat sat on the mat,” there are three historical ways you could go about it.

Era 1: The Rule-Based(Heuristic) Era (1950s – 1990s)

The Philosophy: “If we can write down all the rules of grammar, the computer will understand language.” In this era, linguists sat down and wrote thousands of complex “If-Then” rules. It was like building a massive digital dictionary and grammar book combined.

  • The Approach: Regular Expressions (Regex), WordNet (a dictionary-like database) and hard-coded grammars.
  • The Example: To find a date, we’d write a rule: “Look for 2 digits, then a slash, then 2 digits, then a slash, then 4 digits.” or If the sentence contains “Hi” or “Hello,” classify it as a “Greeting.”
  • Pros: Very fast; easy to debug; doesn’t need data.
  • The Failure: It couldn’t handle the “messiness” of real life. If a user wrote “Jan 1st, 99,” the rule-based system would break because it wasn’t expecting that specific format.
  • Insight: This era was great for logic but terrible for scale. We can’t write a rule for every possible way a human might say “hello.”

Era 2: The Machine Learning / Statistical Era (1990s – 2014)

The Philosophy: “Don’t tell the computer the rules. Give it a billion sentences and let it count which words usually appear together.” This era moved away from linguistics and toward probability. Instead of saying “A noun follows an adjective,” the computer would say, “I have seen 10,000 cases where ‘Green’ is followed by ‘Apple,’ so there is a 90% chance the next word is a noun.”

  • The Approach: Naive Bayes, Support Vector Machines (SVM), Logistic Regression, Hidden Markov Models (HMM) and N-grams.
  • The Example: Our phone’s “Autocorrect” or “Predictive Text.” It doesn’t know what you are thinking; it just knows that after you type “How are,” the most statistically likely next word is “you.”
  • The Failure: It lacked “long-term memory.” A statistical model might understand a 3-word phrase but forget the beginning of the sentence by the time it reached the end. These models often treat a sentence like a “bag of words”—they know the words exist, but they forget the order (e.g., “Dog bites man” and “Man bites dog” look the same to them).
  • Insight: This era gave us the first functional Google Translate and Spam filters. It worked at scale, but it didn’t understand nuance.

Era 3: The Deep Learning / Neural Era (2014 – Present)

The Philosophy: “Build a ‘brain’ (Neural Network) that can represent words as mathematical vectors in a multi-dimensional space.” This is the era of “Representations.” Instead of counting words, we turn words into numbers (vectors). Words with similar meanings are placed close together in a mathematical “map.”

  • The Approach: Recurrent Neural Networks (RNNs), Transformers (BERT, GPT) and embeddings (Word2Vec, GloVe).
  • The Example: ChatGPT or Google’s “Search Generative Experience.”
  • Why it’s better:
    • It understands Context.
    • In the Statistical era, “Bank” was just a word.
    • In the Neural era, the model looks at the surrounding words. If it sees “water” and “flow,” it mathematically shifts the meaning of “Bank” toward the “River” side of its map.
    • Sequence Awareness: It understands that the meaning of a word depends on the words that came before and after it.
    • No Feature Engineering: You don’t have to tell the model what to look for; it discovers the patterns itself.
  • Insight: We stopped trying to program the machine and started training the machine.

9. The NLP Pipeline: Building End-to-End Systems

An NLP Pipeline is the set of steps a data scientist follows to build a language-based application. While we often focus on the “Model,” the model is actually only a small part of the entire engineering effort.

9.1 Data Acquisition

Before we can analyze language, we must find it. This involves:

  • Internal Data: SQL databases, customer logs, or emails.
  • External Data: Web scraping, using APIs (like Twitter/X), or public datasets (Kaggle).
  • Data Augmentation: Creating “synthetic” data if you don’t have enough.

---
title: "9.1 NLP Data Acquisition Strategy"
---
%%{init: {"flowchart": {"htmlLabels": true}}}%%
flowchart LR

    A{"<b>How to Acquire Data?</b>"} 
    
    A --> B["<b>1. Internal (Available)</b>"]
    A --> C["<b>2. External (Sourcing)</b>"]
    A --> D["<b>3. Non-Existent</b>"]

    %% Path 1: Internal
    B --> B1["Direct Access (CSV, SQL)"]
    B --> B2["Data Warehouse (Requires DE)"]
    B --> B3["Insufficient Data"]
    
    B3 --> BA["<b>Data Augmentation</b>"]
    BA --> BA1["Synonym Replacement"] & BA2["Bigram Flip"] & BA3["Back-Translation"] & BA4["Adding Noise"]

    %% Path 2: External
    C --> C1["Public Datasets (Kaggle, UCI)"]
    C --> C2["Web Scraping (BeautifulSoup or Selenium)"]
    C --> C3["APIs (Twitter, Reddit)"]
    C --> C4["Unstructured (PDF, Image, Audio)"]

    %% Path 3: Non-Existent
    D --> D1["Manual Labeling (LabelStudio, Prodigy)"]
    D --> D2["Surveys / Crowd-sourcing"]
style A color:#000,fill:#FFF9C4,stroke:#000

9.2 Text Preparation (The “Cleaning Lab”)

This is often the most time-consuming step (60-70% of the project). It is divided into three “strengths”:

  1. Text Cleanup: Fixing OCR errors, stripping HTML tags, handling emojis, and correcting spelling.
  2. Basic Preprocessing: Removing punctuation, converting to lowercase, removing stop words, and Tokenization (breaking sentences into words).
  3. Advanced Preprocessing: * POS Tagging: Identifying nouns vs. verbs.
  • Constituency/Dependency Parsing: Understanding the sentence structure.
  • Coreference Resolution: Figuring out that “He” in sentence 2 refers to “Ram” in sentence 1.

---
title: "9.2 Text Preparation & Preprocessing"
---
%%{init: {"flowchart": {"htmlLabels": true}}}%%
flowchart LR

    A{"<b>Text Preparation</b>"} --> B["<b>Step A: Cleanup</b><br/>(Noise Removal)"]
    A --> C["<b>Step B: Basic Preprocessing</b><br/>(Standardization)"]
    A --> D["<b>Step C: Advanced Preprocessing</b><br/>(Enrichment)"]

    subgraph Cleaning[" "]
        B --> B1["HTML Tag Stripping"]
        B --> B2["Emoji/URL Removal"]
        B --> B3["Spelling Correction"]
    end

    subgraph Basic[" "]
        C --> C_Must["<b>Must Do</b>"]
        C --> C_Opt["<b>Optional / Contextual</b>"]
        
        C_Must --> CM1["Tokenization<br/>(Word & Sentence)"]
        
        C_Opt --> CO1["Lowercasing"]
        C_Opt --> CO2["Stopword Removal"]
        C_Opt --> CO3["Punctuation & Digit Removal"]
        C_Opt --> CO4["Normalization<br/>(Stemming & Lemmatization)"]
    end

    subgraph Advanced[" "]
        D --> D1["POS Tagging"]
        D --> D2["Dependency Parsing"]
        D --> D3["Coreference Resolution"]
    end

    style A color:#000,fill:#FFF9C4,stroke:#000

9.3 Feature Engineering (Text to Numbers)

Since machines cannot calculate “words,” we must transform them into a mathematical format.

  • Classical Methods: One-Hot Encoding, Bag-of-Words (BoW), TF-IDF.
  • Modern Methods: Word Embeddings (Word2Vec, GloVe).

9.4 Modeling & Evaluation

  • Model Building: Choosing an algorithm (e.g., Naive Bayes for speed or a Transformer for accuracy).
  • Evaluation: Using metrics like Precision, Recall, and F1-Score. In NLP, we also use specific metrics like BLEU (for translation) or ROUGE (for summarization).

9.5 Deployment & Monitoring

A model is “living” software.

  • Deployment: Packaging the model into an API (using Flask or FastAPI) so a website or app can use it.
  • Monitoring: Watching for “Model Drift”—as language changes (new slang, new topics), the model’s accuracy will naturally drop over time.
  • Update: Collecting new data to retrain and improve the model.

---
title: "4. Modeling & Evaluation"
---
%%{init: {"flowchart": {"htmlLabels": true}}}%%
flowchart LR

    A{"<b>Modeling Strategy</b>"} --> B["<b>Heuristic Approach</b><br/>(Rules / Regex)<br/><i>Use if: Very little data</i>"]
    A --> C["<b>Machine Learning</b><br/>(NB, SVM, Random Forest)<br/><i>Use if: 'Thik-thak' data</i>"]
    A --> D["<b>Deep Learning</b><br/>(Transformers, LSTMs)<br/><i>Use if: Massive data + GPU</i>"]
    A --> E["<b>Cloud / LLM APIs</b><br/>(GPT-4, Claude, Gemini)<br/><i>Use if: 'Andha Paisa'</i>"]

    B & C & D & E --> F{"<b>Evaluation</b>"}
    
    subgraph Metrics["How to check if it works?"]
        F --> F1["Intrinsic: Accuracy, F1-Score, Perplexity"]
        F --> F2["Extrinsic: Business Impact (e.g., Sales increase)"]
    end

    style A color:#000,fill:#FFF9C4,stroke:#000

Important Notes:

  1. The Non-Linear Reality: We don’t just go 9.1 to 9.5. If our Evaluation shows poor results, we might realize our Text Preparation was too aggressive, and we’ll go back to step 9.2.
  2. ML vs. DL: In Deep Learning, steps 9.2 and 9..3 are often “collapsed.” The neural network learns the features (step 3) automatically from the raw text.
  3. Not “One Size Fits All”: A pipeline for a Chatbot requires a “Dialog Management” component that a Sentiment Analysis pipeline doesn’t need.